An Analysis of Mushroom Data from Philipps-Universität Marburg¶

Aidan Henbest
Dr. Bixler
Data Science
26 January 2023

Introduction¶

This data set includes information regarding the class, cap diameter, cap shape, cap surface, cap color, does bruise or bleed, gill attachment, gill spacing, gill color, stem height, stem width, stem root, stem surface, stem color, veil type, veil color, has ring, ring type, spore print color, habitat, and season of 61,069 mushrooms. This data set was retrieved from the Philipps-Universität Marburg website. Although this data is missing many values in some of the columns and is not fully evenly distributed in all categories, there are many values in each of the categories despite that due to the high number of total mushrooms. The data set was chosen because it has an extensive number of lines and a wide variety of columns for machine learning algorithms to learn from. There are no ethical considerations for using this data set as it does not include data on anything that will affect humans or the environment. Included here is a diagram of a basic mushroom to show what many of the terms referenced in this analysis mean:

Link to dataset: https://mushroom.mathematik.uni-marburg.de/

Initial Data Analysis¶

In [1]:
# Import statements
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from matplotlib import cm as cmap
from sklearn.preprocessing import StandardScaler
import warnings
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Filters warnings
warnings.filterwarnings('ignore')

# Allows pandas to print everything without ellipses
pd.set_option('display.max_rows', None, 'display.max_columns', None)

# Set Style
sns.set()

# Creates the data frame
df = pd.read_csv('./data/mushrooms/secondary_data_generated.csv', sep = ';')

# Shows the name of each column, number of entries in each column, and the data type of each column for each data frame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61069 entries, 0 to 61068
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 61069 non-null  object 
 1   cap-diameter          61069 non-null  float64
 2   cap-shape             61069 non-null  object 
 3   cap-surface           46949 non-null  object 
 4   cap-color             61069 non-null  object 
 5   does-bruise-or-bleed  61069 non-null  object 
 6   gill-attachment       51185 non-null  object 
 7   gill-spacing          36006 non-null  object 
 8   gill-color            61069 non-null  object 
 9   stem-height           61069 non-null  float64
 10  stem-width            61069 non-null  float64
 11  stem-root             9531 non-null   object 
 12  stem-surface          22945 non-null  object 
 13  stem-color            61069 non-null  object 
 14  veil-type             3177 non-null   object 
 15  veil-color            7413 non-null   object 
 16  has-ring              61069 non-null  object 
 17  ring-type             58598 non-null  object 
 18  spore-print-color     6354 non-null   object 
 19  habitat               61069 non-null  object 
 20  season                61069 non-null  object 
dtypes: float64(3), object(18)
memory usage: 9.8+ MB
In [2]:
# Shows the memory used by each column for each data frame
df.memory_usage(deep = True)
Out[2]:
Index                       128
class                   3542002
cap-diameter             488552
cap-shape               3542002
cap-surface             3174882
cap-color               3542002
does-bruise-or-bleed    3542002
gill-attachment         3285018
gill-spacing            2890364
gill-color              3542002
stem-height              488552
stem-width               488552
stem-root               2202014
stem-surface            2550778
stem-color              3542002
veil-type               2036810
veil-color              2146946
has-ring                3542002
ring-type               3477756
spore-print-color       2119412
habitat                 3542002
season                  3542002
dtype: int64
In [3]:
# Shows the number of values missing in each column for each data frame
df.isna().sum()
Out[3]:
class                       0
cap-diameter                0
cap-shape                   0
cap-surface             14120
cap-color                   0
does-bruise-or-bleed        0
gill-attachment          9884
gill-spacing            25063
gill-color                  0
stem-height                 0
stem-width                  0
stem-root               51538
stem-surface            38124
stem-color                  0
veil-type               57892
veil-color              53656
has-ring                    0
ring-type                2471
spore-print-color       54715
habitat                     0
season                      0
dtype: int64
In [4]:
# Shows the first 10 rows of the data frame
df.head()
Out[4]:
class cap-diameter cap-shape cap-surface cap-color does-bruise-or-bleed gill-attachment gill-spacing gill-color stem-height stem-width stem-root stem-surface stem-color veil-type veil-color has-ring ring-type spore-print-color habitat season
0 p 13.83 x h e f e NaN w 18.05 18.08 s y w u w t p NaN d w
1 p 16.92 x g o f e NaN w 18.70 18.10 s y w u w t g NaN d u
2 p 15.92 x g o f e NaN w 17.86 18.65 s y w u w t g NaN d u
3 p 15.73 f g o f e NaN w 16.82 17.71 s y w u w t p NaN d u
4 p 13.84 x h o f e NaN w 18.07 18.49 s y w u w t g NaN d w
In [5]:
# Performs a basic statistical analysis on the data frame
df.describe()
Out[5]:
cap-diameter stem-height stem-width
count 61069.000000 61069.000000 61069.000000
mean 6.746893 6.588775 12.155013
std 5.262972 3.362591 9.989620
min 0.410000 0.000000 0.000000
25% 3.490000 4.640000 5.200000
50% 5.890000 5.960000 10.180000
75% 8.540000 7.760000 16.600000
max 61.580000 35.790000 100.830000
In [6]:
# Shows the number of values in each category of each column of the data frame
print(df['class'].value_counts())
print()
print(df['cap-shape'].value_counts())
print()
print(df['cap-surface'].value_counts())
print()
print(df['cap-color'].value_counts())
print()
print(df['does-bruise-or-bleed'].value_counts())
print()
print(df['gill-attachment'].value_counts())
print()
print(df['gill-spacing'].value_counts())
print()
print(df['gill-color'].value_counts())
print()
print(df['stem-root'].value_counts())
print()
print(df['stem-surface'].value_counts())
print()
print(df['stem-color'].value_counts())
print()
print(df['veil-type'].value_counts())
print()
print(df['veil-color'].value_counts())
print()
print(df['has-ring'].value_counts())
print()
print(df['ring-type'].value_counts())
print()
print(df['spore-print-color'].value_counts())
print()
print(df['habitat'].value_counts())
print()
print(df['season'].value_counts())
p    33888
e    27181
Name: class, dtype: int64

x    26802
f    13492
s     7099
b     5697
o     3472
p     2685
c     1822
Name: cap-shape, dtype: int64

t    8133
s    7577
y    6396
h    5029
g    4740
d    4440
e    2590
k    2283
i    2198
w    2151
l    1412
Name: cap-surface, dtype: int64

n    24438
y     8487
w     7592
g     4363
e     4027
o     3625
r     1804
u     1783
p     1652
b     1258
k     1223
l      817
Name: cap-color, dtype: int64

f    50479
t    10590
Name: does-bruise-or-bleed, dtype: int64

a    12676
d    10269
x     7413
p     6001
e     5648
s     5648
f     3530
Name: gill-attachment, dtype: int64

c    24710
d     7766
f     3530
Name: gill-spacing, dtype: int64

w    18617
n     9770
y     9419
p     6023
g     4087
f     3530
o     2924
k     2355
r     1385
e     1013
u     1012
b      934
Name: gill-color, dtype: int64

s    3177
b    3177
r    1412
f    1059
c     706
Name: stem-root, dtype: int64

s    5977
y    4938
i    4401
t    2657
g    1765
k    1609
f    1059
h     539
Name: stem-surface, dtype: int64

w    22943
n    18079
y     7788
g     2656
o     2221
e     2039
u     1488
f     1059
p     1034
k      831
r      532
l      239
b      160
Name: stem-color, dtype: int64

u    3177
Name: veil-type, dtype: int64

w    5485
n     525
y     516
u     353
k     353
e     181
Name: veil-color, dtype: int64

f    45890
t    15179
Name: has-ring, dtype: int64

f    48361
e     2468
z     2118
r     1425
l     1404
g     1239
p     1230
m      353
Name: ring-type, dtype: int64

k    2102
w    1237
p    1234
n    1059
g     353
r     186
u     183
Name: spore-print-color, dtype: int64

d    44348
g     7825
l     3117
m     2972
h     1963
w      353
p      350
u      141
Name: habitat, dtype: int64

a    30150
u    22857
w     5333
s     2729
Name: season, dtype: int64
In [7]:
# Creates a heatmap of the correlation in the data frame
plt.figure(figsize = (12, 10))
sns.heatmap(df.corr(), cmap = cmap.PiYG, annot = True, center = 0)
plt.title('Correlation', fontsize = 16)
plt.xlabel('Column', fontsize = 14)
plt.ylabel('Column', fontsize = 14)
Out[7]:
Text(117.24999999999999, 0.5, 'Column')
In [8]:
# Creates a copy of the data frame
X = df.copy()

# Makes the categorical data in the data frame copy into numeric data
columns = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'does-bruise-or-bleed',
           'gill-attachment', 'gill-spacing', 'gill-color', 'stem-root',
           'stem-surface', 'stem-color', 'veil-type', 'veil-color', 'has-ring',
           'ring-type', 'spore-print-color', 'habitat', 'season']
for i in columns:
    X[i] = pd.Categorical(X[i])
    X[i] = X[i].cat.codes

# Splits the answers off from the rest of the data frame 
y = X.pop('does-bruise-or-bleed')

# Creates the training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Scales the training and testing datasets
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Shows the first 10 rows of the data frame
X.head(10)
Out[8]:
class cap-diameter cap-shape cap-surface cap-color gill-attachment gill-spacing gill-color stem-height stem-width stem-root stem-surface stem-color veil-type veil-color has-ring ring-type spore-print-color habitat season
0 1 13.83 6 3 1 2 -1 10 18.05 18.08 4 7 11 0 4 1 5 -1 0 3
1 1 16.92 6 2 6 2 -1 10 18.70 18.10 4 7 11 0 4 1 2 -1 0 2
2 1 15.92 6 2 6 2 -1 10 17.86 18.65 4 7 11 0 4 1 2 -1 0 2
3 1 15.73 2 2 6 2 -1 10 16.82 17.71 4 7 11 0 4 1 5 -1 0 2
4 1 13.84 6 3 6 2 -1 10 18.07 18.49 4 7 11 0 4 1 2 -1 0 3
5 1 12.25 2 3 6 2 -1 10 17.42 17.63 4 7 11 0 4 1 2 -1 0 3
6 1 14.27 2 3 1 2 -1 10 18.19 17.24 4 7 11 0 4 1 5 -1 0 3
7 1 15.44 2 3 1 2 -1 10 16.80 17.47 4 7 11 0 4 1 2 -1 0 2
8 1 13.11 6 2 1 2 -1 10 16.86 17.76 4 7 11 0 4 1 5 -1 0 3
9 1 16.90 2 3 6 2 -1 10 18.55 19.13 4 7 11 0 4 1 5 -1 0 2

    Preprocessing of the data consisted of importing packages, changing settings, determining any flaws in the data, looking at some basic data analyses, and formatting the data for machine learning. After all of the packages were imported, some basic settings were set. Warnings were suppressed so future warnings do not aesthetically affect this analysis. Ellipses were removed so the entire dataset is shown when it is printed in this notebook. The style of the future plots was also set.

    After this, the info command was used to see the data types of the columns. All of the columns are strings beside three columns that are floats. Then, the memory usage command was used to show how much memory each columns uses. The final step of determining flaws included determining how many items were empty in each column. While there are many missing values, this should not severely impact this analysis as there are so many rows in total.

    Next, the head command was shown to view the data frame. It appeared as expected. The describe function was then used, however, only the three float values could be analyzed: cap diameter, stem height, and stem width. The averages for these numeric values were 6.75 cm, 6.59 cm, and 12.16 mm respectively. After this, a values count was performed on all of the non-numeric columns. It appears that many of the columns are unevenly distributed. In particular, the spore print color and habitat variables had very few rows for some of the values. The final piece of analysis was a heat map made between the numeric variables. Cap diameter and stem width are strongly correlated, but stem height is only weakly correlated with stem width and cap diameter.

    The last step of the initial data analysis was creating a copy of the data frame in which all of the categorical data was converted to numeric data so it can be used in machine learning algorithms. Then the answer column was split off from the main data frame so they were separated when training the data. Next, the data was split into training and testing data. Eighty percent of the data is being used for training and twenty percent of the data is being used for testing. Lastly, the data was scaled so it is all between zero and one and was subsequently printing out using the head function.

How does cap diameter correlate with stem height or stem width?¶

In [9]:
# Creates a scatter plot and a line of best fit from the data frame 
plt.figure(figsize = (12, 10))
sns.regplot(x = 'cap-diameter', y = 'stem-height', data = df, scatter = False, color = '#d01c8b')
sns.scatterplot(x = 'cap-diameter', y = 'stem-height', data = df, s = 1, color = '#4dac26')
plt.title('Cap Diameter versus Stem Height', fontsize = 16)
plt.xlabel('Cap Diameter (cm)', fontsize = 14)
plt.ylabel('Stem Height (cm)', fontsize = 14)
Out[9]:
Text(0, 0.5, 'Stem Height (cm)')
In [10]:
# Creates a scatter plot and a line of best fit from the data frame
plt.figure(figsize = (12, 10))
sns.regplot(x = 'cap-diameter', y = 'stem-width', data = df, scatter = False, color = '#d01c8b')
sns.scatterplot(x = 'cap-diameter', y = 'stem-width', data = df, s = 1, color = '#4dac26')
plt.title('Cap Diameter versus Stem Width', fontsize = 16)
plt.xlabel('Cap Diameter (cm)', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
Out[10]:
Text(0, 0.5, 'Stem Width (mm)')

    As mentioned before, cap diameter has a stronger correlation with stem width than stem height. However, there is a clear positive correlation for both. A scatter plot was created with a line of best fit for both of these variables and it is clear that there is a positive incline. However, there are some outliers in both of these graphs. There are some mushrooms with low cap diameters and oddly high stem widths/heights. Inversely, there are some mushrooms with high cap diameters and oddly low stem widths/heights. Furthermore, there are some obvious groupings in both of these graphs. In particular, there are some mushrooms that do not have stems at all, and therefore they have a stem width and height of zero. These mushrooms may have interfered with the correlation table, but it is unlikely they affected it much due to the low number of them.

How does stem height correlate with stem width?¶

In [11]:
# Creates a scatter plot and a line of best fit from the data frame
plt.figure(figsize = (12, 10))
sns.regplot(x = 'stem-height', y = 'stem-width', data = df, scatter = False, color = '#d01c8b')
sns.scatterplot(x = 'stem-height', y = 'stem-width', data = df, s = 1, color = '#4dac26')
plt.title('Stem Height versus Stem Width', fontsize = 16)
plt.xlabel('Stem Height (cm)', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
Out[11]:
Text(0, 0.5, 'Stem Width (mm)')

    It is clear that the correlation table created previously was correct, there is little to no correlation between stem width and stem height. A scatter plot was created to display these points with a line of best fit overlaid on top, and it is incredibly obvious that these points are scattered all over with no regularity. Some points appear to make groups, which may indicate they come from the same species, but this cannot be determined definitely. Mushrooms can clearly have thick but short stems or thin but long stems, despite the weak positive correlation.

How does cap diameter correlate with cap shape or cap surface?¶

In [12]:
# Creates a bar plot from the data frame
plt.figure(figsize = (12, 10))
sns.barplot(x = 'cap-shape', y = 'cap-diameter', data = df, palette = 'PiYG')
plt.title('Cap Shape versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Shape', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6], ['Convex', 'Flat', 'Spherical', 'Bell', 'Conical', 'Sunken', 'Others'])
Out[12]:
([<matplotlib.axis.XTick at 0x14f0ad04700>,
  <matplotlib.axis.XTick at 0x14f0ad046d0>,
  <matplotlib.axis.XTick at 0x14f0acfa340>,
  <matplotlib.axis.XTick at 0x14f0ad55460>,
  <matplotlib.axis.XTick at 0x14f0ad55bb0>,
  <matplotlib.axis.XTick at 0x14f0ad5a340>,
  <matplotlib.axis.XTick at 0x14f0ad5aa90>],
 [Text(0, 0, 'Convex'),
  Text(1, 0, 'Flat'),
  Text(2, 0, 'Spherical'),
  Text(3, 0, 'Bell'),
  Text(4, 0, 'Conical'),
  Text(5, 0, 'Sunken'),
  Text(6, 0, 'Others')])
In [13]:
# Creates a box and strip plot from the data frame
plt.figure(figsize = (12, 10))
sns.stripplot(x = 'cap-shape', y = 'cap-diameter', data = df, color = 'k', alpha = 0.01)
sns.boxplot(x = 'cap-shape', y = 'cap-diameter', data = df, palette = 'PiYG', showfliers = False)
plt.title('Cap Shape versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Shape', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6], ['Convex', 'Flat', 'Spherical', 'Bell', 'Conical', 'Sunken', 'Others'])
Out[13]:
([<matplotlib.axis.XTick at 0x14f0ad881f0>,
  <matplotlib.axis.XTick at 0x14f0ad881c0>,
  <matplotlib.axis.XTick at 0x14f0ad7fa30>,
  <matplotlib.axis.XTick at 0x14f0adc9ca0>,
  <matplotlib.axis.XTick at 0x14f0add7430>,
  <matplotlib.axis.XTick at 0x14f0adc9a00>,
  <matplotlib.axis.XTick at 0x14f0add70a0>],
 [Text(0, 0, 'Convex'),
  Text(1, 0, 'Flat'),
  Text(2, 0, 'Spherical'),
  Text(3, 0, 'Bell'),
  Text(4, 0, 'Conical'),
  Text(5, 0, 'Sunken'),
  Text(6, 0, 'Others')])
In [14]:
# Creates a violin plot from the data frame
plt.figure(figsize = (12, 10))
sns.violinplot(x = 'cap-shape', y = 'cap-diameter', data = df, palette = 'PiYG', inner = None)
plt.title('Cap Shape versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Shape', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6], ['Convex', 'Flat', 'Spherical', 'Bell', 'Conical', 'Sunken', 'Others'])
Out[14]:
([<matplotlib.axis.XTick at 0x14f0ae746a0>,
  <matplotlib.axis.XTick at 0x14f0ae74670>,
  <matplotlib.axis.XTick at 0x14f0ae682e0>,
  <matplotlib.axis.XTick at 0x14f0aeb9a00>,
  <matplotlib.axis.XTick at 0x14f0aebf190>,
  <matplotlib.axis.XTick at 0x14f0aebf8e0>,
  <matplotlib.axis.XTick at 0x14f0aec6070>],
 [Text(0, 0, 'Convex'),
  Text(1, 0, 'Flat'),
  Text(2, 0, 'Spherical'),
  Text(3, 0, 'Bell'),
  Text(4, 0, 'Conical'),
  Text(5, 0, 'Sunken'),
  Text(6, 0, 'Others')])
In [15]:
# Creates a bar plot from the data frame
plt.figure(figsize = (12, 10))
sns.barplot(x = 'cap-surface', y = 'cap-diameter', data = df, palette = 'PiYG')
plt.title('Cap Surface versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Surface', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], ['Shiny', 'Grooves', 'Sticky', 'Scaly', 'Fleshy', 'Smooth', 'Leathery', 'd', 'Wrinkled', 'Fibrous', 'Silky'])
Out[15]:
([<matplotlib.axis.XTick at 0x14f0aeecc40>,
  <matplotlib.axis.XTick at 0x14f0aeecc10>,
  <matplotlib.axis.XTick at 0x14f0aeec340>,
  <matplotlib.axis.XTick at 0x14f0af45c10>,
  <matplotlib.axis.XTick at 0x14f0af513a0>,
  <matplotlib.axis.XTick at 0x14f0af45be0>,
  <matplotlib.axis.XTick at 0x14f0af51130>,
  <matplotlib.axis.XTick at 0x14f0af58310>,
  <matplotlib.axis.XTick at 0x14f0af58a60>,
  <matplotlib.axis.XTick at 0x14f0af5e1f0>,
  <matplotlib.axis.XTick at 0x14f0af5e940>],
 [Text(0, 0, 'Shiny'),
  Text(1, 0, 'Grooves'),
  Text(2, 0, 'Sticky'),
  Text(3, 0, 'Scaly'),
  Text(4, 0, 'Fleshy'),
  Text(5, 0, 'Smooth'),
  Text(6, 0, 'Leathery'),
  Text(7, 0, 'd'),
  Text(8, 0, 'Wrinkled'),
  Text(9, 0, 'Fibrous'),
  Text(10, 0, 'Silky')])
In [16]:
# Creates a box and strip plot from the data frame
plt.figure(figsize = (12, 10))
sns.stripplot(x = 'cap-surface', y = 'cap-diameter', data = df, color = 'k', alpha = 0.01)
sns.boxplot(x = 'cap-surface', y = 'cap-diameter', data = df, palette = 'PiYG', showfliers = False)
plt.title('Cap Surface versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Surface', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], ['Shiny', 'Grooves', 'Sticky', 'Scaly', 'Fleshy', 'Smooth', 'Leathery', 'd', 'Wrinkled', 'Fibrous', 'Silky'])
Out[16]:
([<matplotlib.axis.XTick at 0x14f0af890a0>,
  <matplotlib.axis.XTick at 0x14f0af89070>,
  <matplotlib.axis.XTick at 0x14f0af7edc0>,
  <matplotlib.axis.XTick at 0x14f0bb10b20>,
  <matplotlib.axis.XTick at 0x14f0bb1e520>,
  <matplotlib.axis.XTick at 0x14f0bb1ec70>,
  <matplotlib.axis.XTick at 0x14f0bb24400>,
  <matplotlib.axis.XTick at 0x14f0bb24b50>,
  <matplotlib.axis.XTick at 0x14f0bb246d0>,
  <matplotlib.axis.XTick at 0x14f0bb1e5e0>,
  <matplotlib.axis.XTick at 0x14f0bb2a2e0>],
 [Text(0, 0, 'Shiny'),
  Text(1, 0, 'Grooves'),
  Text(2, 0, 'Sticky'),
  Text(3, 0, 'Scaly'),
  Text(4, 0, 'Fleshy'),
  Text(5, 0, 'Smooth'),
  Text(6, 0, 'Leathery'),
  Text(7, 0, 'd'),
  Text(8, 0, 'Wrinkled'),
  Text(9, 0, 'Fibrous'),
  Text(10, 0, 'Silky')])
In [17]:
# Creates a violin plot from the data frame
plt.figure(figsize = (12, 10))
sns.violinplot(x = 'cap-surface', y = 'cap-diameter', data = df, palette = 'PiYG', inner = None)
plt.title('Cap Surface versus Cap Diameter', fontsize = 16)
plt.xlabel('Cap Surface', fontsize = 14)
plt.ylabel('Cap Diameter (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], ['Shiny', 'Grooves', 'Sticky', 'Scaly', 'Fleshy', 'Smooth', 'Leathery', 'd', 'Wrinkled', 'Fibrous', 'Silky'])
Out[17]:
([<matplotlib.axis.XTick at 0x14f0bc06580>,
  <matplotlib.axis.XTick at 0x14f0bc06550>,
  <matplotlib.axis.XTick at 0x14f0bbf8160>,
  <matplotlib.axis.XTick at 0x14f0c312790>,
  <matplotlib.axis.XTick at 0x14f0c318040>,
  <matplotlib.axis.XTick at 0x14f0c318670>,
  <matplotlib.axis.XTick at 0x14f0c318dc0>,
  <matplotlib.axis.XTick at 0x14f0c31d550>,
  <matplotlib.axis.XTick at 0x14f0c31dca0>,
  <matplotlib.axis.XTick at 0x14f0c31da60>,
  <matplotlib.axis.XTick at 0x14f0c318a30>],
 [Text(0, 0, 'Shiny'),
  Text(1, 0, 'Grooves'),
  Text(2, 0, 'Sticky'),
  Text(3, 0, 'Scaly'),
  Text(4, 0, 'Fleshy'),
  Text(5, 0, 'Smooth'),
  Text(6, 0, 'Leathery'),
  Text(7, 0, 'd'),
  Text(8, 0, 'Wrinkled'),
  Text(9, 0, 'Fibrous'),
  Text(10, 0, 'Silky')])

    Cap shape was compared to cap diameter using a bar plot, a box plot with an overlaid swarm plot, and a violin plot. From these plots it is clear that the others category had the highest average cap diameter and highest maximum cap diameter. The spherical shape had the second highest average cap diameter, but the third highest maximum cap diameter. The flat shape overtook the spherical shape in the maximum comparison, despite the fact that it had only the fourth highest average. The sunken shape had the third highest average and fifth highest maximum, the convex shape had the fifth highest average and the third highest maximum, the conical shape had the sixth highest average and the seventh highest maximum, and the bell shape had the seventh highest average and the sixth highest maximum.

    Cap surface was compared to cap diameter using a bar plot, a box plot with an overlaid swarm plot, and a violin plot. From these plots it is clear that the fleshy surface had the highest average cap diameter, but the ninth highest maximum cap diameter. The remaining surface types are ordered like this by average: scaly, silky, smooth, d, sticky, shiny, fibrous, wrinkled, leathery, grooves. The remaining surface types are ordered like this by maximum: scaly, smooth, d, wrinkled, grooves, sticky, silky, shiny, fibrous, leathery.

How does stem root correlate with stem height or stem width?¶

In [18]:
# Creates a bar plot from the data frame
plt.figure(figsize = (12, 10))
sns.barplot(x = 'stem-root', y = 'stem-height', data = df, palette = 'PiYG')
plt.title('Stem Root versus Stem Height', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Height (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
Out[18]:
([<matplotlib.axis.XTick at 0x14f0c342790>,
  <matplotlib.axis.XTick at 0x14f0c342760>,
  <matplotlib.axis.XTick at 0x14f0c33c550>,
  <matplotlib.axis.XTick at 0x14f0c6e4850>,
  <matplotlib.axis.XTick at 0x14f0c6ed070>],
 [Text(0, 0, 'Swollen'),
  Text(1, 0, 'Bulbous'),
  Text(2, 0, 'Rooted'),
  Text(3, 0, 'Club'),
  Text(4, 0, 'f')])
In [19]:
# Creates a box and strip plot from the data frame
plt.figure(figsize = (12, 10))
sns.stripplot(x = 'stem-root', y = 'stem-height', data = df, color = 'k', alpha = 0.01)
sns.boxplot(x = 'stem-root', y = 'stem-height', data = df, palette = 'PiYG', showfliers = False)
plt.title('Stem Root versus Stem Height', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Height (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
Out[19]:
([<matplotlib.axis.XTick at 0x14f0c7132e0>,
  <matplotlib.axis.XTick at 0x14f0c7132b0>,
  <matplotlib.axis.XTick at 0x14f0c70cb50>,
  <matplotlib.axis.XTick at 0x14f0c754160>,
  <matplotlib.axis.XTick at 0x14f0c7548b0>],
 [Text(0, 0, 'Swollen'),
  Text(1, 0, 'Bulbous'),
  Text(2, 0, 'Rooted'),
  Text(3, 0, 'Club'),
  Text(4, 0, 'f')])
In [20]:
# Creates a violin plot from the data frame
plt.figure(figsize = (12, 10))
sns.violinplot(x = 'stem-root', y = 'stem-height', data = df, palette = 'PiYG', inner = None)
plt.title('Stem Root versus Stem Height', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Height (cm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
Out[20]:
([<matplotlib.axis.XTick at 0x14f0ce90550>,
  <matplotlib.axis.XTick at 0x14f0ce90520>,
  <matplotlib.axis.XTick at 0x14f0ce88100>,
  <matplotlib.axis.XTick at 0x14f0cec5f40>,
  <matplotlib.axis.XTick at 0x14f0cecf640>],
 [Text(0, 0, 'Swollen'),
  Text(1, 0, 'Bulbous'),
  Text(2, 0, 'Rooted'),
  Text(3, 0, 'Club'),
  Text(4, 0, 'f')])
In [21]:
# Creates a bar plot from the data frame
plt.figure(figsize = (12, 10))
sns.barplot(x = 'stem-root', y = 'stem-width', data = df, palette = 'PiYG')
plt.title('Stem Root versus Stem Width', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
Out[21]:
([<matplotlib.axis.XTick at 0x14f0cefb460>,
  <matplotlib.axis.XTick at 0x14f0cefb430>,
  <matplotlib.axis.XTick at 0x14f0cef4220>,
  <matplotlib.axis.XTick at 0x14f0e6004f0>,
  <matplotlib.axis.XTick at 0x14f0e600c40>],
 [Text(0, 0, 'Swollen'),
  Text(1, 0, 'Bulbous'),
  Text(2, 0, 'Rooted'),
  Text(3, 0, 'Club'),
  Text(4, 0, 'f')])
In [22]:
# Creates a box and strip plot from the data frame
plt.figure(figsize = (12, 10))
sns.stripplot(x = 'stem-root', y = 'stem-width', data = df, color = 'k', alpha = 0.01)
sns.boxplot(x = 'stem-root', y = 'stem-width', data = df, palette = 'PiYG', showfliers = False)
plt.title('Stem Root versus Stem Width', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
Out[22]:
([<matplotlib.axis.XTick at 0x14f0e620820>,
  <matplotlib.axis.XTick at 0x14f0e620850>,
  <matplotlib.axis.XTick at 0x14f0e624d60>,
  <matplotlib.axis.XTick at 0x14f0e660e20>,
  <matplotlib.axis.XTick at 0x14f0e66e5b0>],
 [Text(0, 0, 'Swollen'),
  Text(1, 0, 'Bulbous'),
  Text(2, 0, 'Rooted'),
  Text(3, 0, 'Club'),
  Text(4, 0, 'f')])
In [23]:
# Creates a violin plot from the data frame
plt.figure(figsize = (12, 10))
sns.violinplot(x = 'stem-root', y = 'stem-width', data = df, palette = 'PiYG', inner = None)
plt.title('Stem Root versus Stem Width', fontsize = 16)
plt.xlabel('Stem Root', fontsize = 14)
plt.ylabel('Stem Width (mm)', fontsize = 14)
plt.xticks([0, 1, 2, 3, 4], ['Swollen', 'Bulbous', 'Rooted', 'Club', 'f'])
Out[23]:
([<matplotlib.axis.XTick at 0x14f0e6e2100>,
  <matplotlib.axis.XTick at 0x14f0e6e20d0>,
  <matplotlib.axis.XTick at 0x14f0e6de880>,
  <matplotlib.axis.XTick at 0x14f0e719ac0>,
  <matplotlib.axis.XTick at 0x14f0e7211c0>],
 [Text(0, 0, 'Swollen'),
  Text(1, 0, 'Bulbous'),
  Text(2, 0, 'Rooted'),
  Text(3, 0, 'Club'),
  Text(4, 0, 'f')])

    Stem root was compared to stem height using a bar plot, a box plot with an overlaid swarm plot, and a violin plot. Oddly, it appears that the f category lacks any stems, and therefore, they do not have any height or width. Besides the f category, the highest average height was the swollen stem roots, followed by rooted, club, and then bulbous stem roots. The only difference when analyzing by maximum values is that the order of club and bulbous are flipped.

    Stem root was compared to stem width using a bar plot, a box plot with an overlaid swarm plot, and a violin plot. Besides the f category, the highest average width was the club stem roots, followed by swollen, rooted, and then bulbous stem roots. The only difference when analyzing by maximum values is that the order of rooted and bulbous are flipped. It is interesting that bulbous stem roots have the smallest average width and height.

What is the optimal supervised learning algorithm out of these options: logistic regression, linear discriminant analysis, naïve Bayes, support vector machine, classification and regression tree, and k-nearest neighbors?¶

In [24]:
# Creates a list of all of the supervised learning algorithms that will be tested
models = []
models.append(('LR', LogisticRegression(solver = 'liblinear', multi_class = 'ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma = 'auto')))
models.append(('CART', DecisionTreeClassifier()))
models.append(('KNN', KNeighborsClassifier()))

# Creates a list of all the results of each of the supervised learning algorithms that were tested and their names
results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)
    cv_results = cross_val_score(model, X_train, y_train, cv = kfold, scoring = 'accuracy')
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
LR: 0.845522 (0.001647)
LDA: 0.837908 (0.001529)
NB: 0.328278 (0.004411)
SVM: 0.997974 (0.000575)
CART: 0.998629 (0.000648)
KNN: 0.999775 (0.000143)
In [25]:
# Creates a dictionary of all of the names of the supervised learning algorithms that were tested and their mean accuracies
answers = {}
for i in range(len(names)):
    answers[names[i]] = np.mean(results[i])

# Creates a bar plot from the dictionary
plt.figure(figsize = (12, 10))
plt.bar(height = list(answers.values()), x = list(answers.keys()), color = sns.color_palette('PiYG'))
plt.title('Comparison of Algorithms', fontsize = 16)
plt.xlabel('Algorithm', fontsize = 14)
plt.ylabel('Average Accuracy (%)', fontsize = 14)
Out[25]:
Text(0, 0.5, 'Average Accuracy (%)')

    In order to analyze the quality of the supervised learning algorithms, the algorithms first had to be created. The algorithms created are: logistic regression, linear discriminant analysis, naïve Bayes, support vector machine, classification and regression tree, and k-nearest neighbors. Each of these algorithms attempted to predict whether a given mushroom would bruise/bleed or not. K-nearest neighbors was the most accurate with a 99.98% accuracy rate. However, the classification and regression tree as well as the support vector machine algorithms came very close with accuracies of 99.87% and 99.80% accuracy rates, respectfully. The logistic regression and linear discriminant analysis algorithms had a steep drop off in accuracy, with only accuracies of 84.55% and 83.79%, respectfully. There was another steep drop off to the naïve Bayes algorithm, it had an only 32.83% accuracy. All of the algorithms had very low standard deviations and all of the algorithms were visualized in a bar chart.

What is the optimal epoch size for this data set when using an artificial neural network?¶

In [26]:
# Adding all of the layers to the artificial neural network
classifier = Sequential()
classifier.add(Dense(activation = 'relu', input_dim = 20, units = 11, kernel_initializer = 'uniform'))
classifier.add(Dense(activation = 'relu', units = 11, kernel_initializer = 'uniform'))
classifier.add(Dense(activation = 'sigmoid', units = 1, kernel_initializer = 'uniform'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Creates a list of all of the fits and results from the artificial neural network when it is fitted with one to ten epochs
fits = []
accuracies = []
for i in range(1, 11):
    print('Number of Epochs:', i, '\n\nFitting:')
    fits.append(classifier.fit(X_train, y_train, batch_size = 5, epochs = i, verbose = 2))
    print('\nConfusion Matrix:')
    y_pred = (classifier.predict(X_test, verbose = 0) > 0.5)
    cm = confusion_matrix(y_test, y_pred)
    print(cm, '\n\nAccuracy: ')
    tn, fp, fn, tp = cm.ravel()
    accuracy = (tn + tp) / (tn + tp + fn + fp)
    accuracies.append(accuracy)
    print(accuracy, '\n\n========================================================================\n')
Number of Epochs: 1 

Fitting:
9771/9771 - 26s - loss: 0.1436 - accuracy: 0.9414 - 26s/epoch - 3ms/step

Confusion Matrix:
[[10056    28]
 [  160  1970]] 

Accuracy: 
0.9846078270836744 

========================================================================

Number of Epochs: 2 

Fitting:
Epoch 1/2
9771/9771 - 22s - loss: 0.0263 - accuracy: 0.9911 - 22s/epoch - 2ms/step
Epoch 2/2
9771/9771 - 23s - loss: 0.0153 - accuracy: 0.9950 - 23s/epoch - 2ms/step

Confusion Matrix:
[[10060    24]
 [   13  2117]] 

Accuracy: 
0.9969706893728508 

========================================================================

Number of Epochs: 3 

Fitting:
Epoch 1/3
9771/9771 - 23s - loss: 0.0117 - accuracy: 0.9959 - 23s/epoch - 2ms/step
Epoch 2/3
9771/9771 - 22s - loss: 0.0103 - accuracy: 0.9965 - 22s/epoch - 2ms/step
Epoch 3/3
9771/9771 - 22s - loss: 0.0090 - accuracy: 0.9969 - 22s/epoch - 2ms/step

Confusion Matrix:
[[10058    26]
 [   20  2110]] 

Accuracy: 
0.9962338300311119 

========================================================================

Number of Epochs: 4 

Fitting:
Epoch 1/4
9771/9771 - 23s - loss: 0.0073 - accuracy: 0.9974 - 23s/epoch - 2ms/step
Epoch 2/4
9771/9771 - 23s - loss: 0.0069 - accuracy: 0.9977 - 23s/epoch - 2ms/step
Epoch 3/4
9771/9771 - 21s - loss: 0.0068 - accuracy: 0.9980 - 21s/epoch - 2ms/step
Epoch 4/4
9771/9771 - 21s - loss: 0.0056 - accuracy: 0.9984 - 21s/epoch - 2ms/step

Confusion Matrix:
[[10056    28]
 [    8  2122]] 

Accuracy: 
0.9970525626330441 

========================================================================

Number of Epochs: 5 

Fitting:
Epoch 1/5
9771/9771 - 22s - loss: 0.0052 - accuracy: 0.9985 - 22s/epoch - 2ms/step
Epoch 2/5
9771/9771 - 21s - loss: 0.0049 - accuracy: 0.9985 - 21s/epoch - 2ms/step
Epoch 3/5
9771/9771 - 22s - loss: 0.0042 - accuracy: 0.9986 - 22s/epoch - 2ms/step
Epoch 4/5
9771/9771 - 21s - loss: 0.0051 - accuracy: 0.9985 - 21s/epoch - 2ms/step
Epoch 5/5
9771/9771 - 22s - loss: 0.0045 - accuracy: 0.9987 - 22s/epoch - 2ms/step

Confusion Matrix:
[[10082     2]
 [   19  2111]] 

Accuracy: 
0.9982806615359424 

========================================================================

Number of Epochs: 6 

Fitting:
Epoch 1/6
9771/9771 - 22s - loss: 0.0040 - accuracy: 0.9989 - 22s/epoch - 2ms/step
Epoch 2/6
9771/9771 - 21s - loss: 0.0043 - accuracy: 0.9987 - 21s/epoch - 2ms/step
Epoch 3/6
9771/9771 - 22s - loss: 0.0034 - accuracy: 0.9989 - 22s/epoch - 2ms/step
Epoch 4/6
9771/9771 - 21s - loss: 0.0040 - accuracy: 0.9989 - 21s/epoch - 2ms/step
Epoch 5/6
9771/9771 - 21s - loss: 0.0040 - accuracy: 0.9988 - 21s/epoch - 2ms/step
Epoch 6/6
9771/9771 - 21s - loss: 0.0032 - accuracy: 0.9990 - 21s/epoch - 2ms/step

Confusion Matrix:
[[10083     1]
 [   12  2118]] 

Accuracy: 
0.9989356476174881 

========================================================================

Number of Epochs: 7 

Fitting:
Epoch 1/7
9771/9771 - 21s - loss: 0.0039 - accuracy: 0.9988 - 21s/epoch - 2ms/step
Epoch 2/7
9771/9771 - 22s - loss: 0.0035 - accuracy: 0.9991 - 22s/epoch - 2ms/step
Epoch 3/7
9771/9771 - 21s - loss: 0.0029 - accuracy: 0.9990 - 21s/epoch - 2ms/step
Epoch 4/7
9771/9771 - 22s - loss: 0.0039 - accuracy: 0.9989 - 22s/epoch - 2ms/step
Epoch 5/7
9771/9771 - 21s - loss: 0.0030 - accuracy: 0.9990 - 21s/epoch - 2ms/step
Epoch 6/7
9771/9771 - 20s - loss: 0.0030 - accuracy: 0.9990 - 20s/epoch - 2ms/step
Epoch 7/7
9771/9771 - 22s - loss: 0.0033 - accuracy: 0.9991 - 22s/epoch - 2ms/step

Confusion Matrix:
[[10075     9]
 [   12  2118]] 

Accuracy: 
0.9982806615359424 

========================================================================

Number of Epochs: 8 

Fitting:
Epoch 1/8
9771/9771 - 22s - loss: 0.0024 - accuracy: 0.9993 - 22s/epoch - 2ms/step
Epoch 2/8
9771/9771 - 21s - loss: 0.0033 - accuracy: 0.9992 - 21s/epoch - 2ms/step
Epoch 3/8
9771/9771 - 22s - loss: 0.0024 - accuracy: 0.9992 - 22s/epoch - 2ms/step
Epoch 4/8
9771/9771 - 21s - loss: 0.0025 - accuracy: 0.9993 - 21s/epoch - 2ms/step
Epoch 5/8
9771/9771 - 23s - loss: 0.0023 - accuracy: 0.9993 - 23s/epoch - 2ms/step
Epoch 6/8
9771/9771 - 23s - loss: 0.0030 - accuracy: 0.9992 - 23s/epoch - 2ms/step
Epoch 7/8
9771/9771 - 22s - loss: 0.0031 - accuracy: 0.9993 - 22s/epoch - 2ms/step
Epoch 8/8
9771/9771 - 23s - loss: 0.0024 - accuracy: 0.9993 - 23s/epoch - 2ms/step

Confusion Matrix:
[[10084     0]
 [   30  2100]] 

Accuracy: 
0.9975438021942034 

========================================================================

Number of Epochs: 9 

Fitting:
Epoch 1/9
9771/9771 - 22s - loss: 0.0025 - accuracy: 0.9995 - 22s/epoch - 2ms/step
Epoch 2/9
9771/9771 - 22s - loss: 0.0030 - accuracy: 0.9993 - 22s/epoch - 2ms/step
Epoch 3/9
9771/9771 - 22s - loss: 0.0025 - accuracy: 0.9992 - 22s/epoch - 2ms/step
Epoch 4/9
9771/9771 - 22s - loss: 0.0022 - accuracy: 0.9993 - 22s/epoch - 2ms/step
Epoch 5/9
9771/9771 - 22s - loss: 0.0026 - accuracy: 0.9992 - 22s/epoch - 2ms/step
Epoch 6/9
9771/9771 - 22s - loss: 0.0023 - accuracy: 0.9993 - 22s/epoch - 2ms/step
Epoch 7/9
9771/9771 - 22s - loss: 0.0019 - accuracy: 0.9994 - 22s/epoch - 2ms/step
Epoch 8/9
9771/9771 - 22s - loss: 0.0027 - accuracy: 0.9992 - 22s/epoch - 2ms/step
Epoch 9/9
9771/9771 - 22s - loss: 0.0018 - accuracy: 0.9993 - 22s/epoch - 2ms/step

Confusion Matrix:
[[10083     1]
 [    2  2128]] 

Accuracy: 
0.9997543802194203 

========================================================================

Number of Epochs: 10 

Fitting:
Epoch 1/10
9771/9771 - 22s - loss: 0.0024 - accuracy: 0.9995 - 22s/epoch - 2ms/step
Epoch 2/10
9771/9771 - 22s - loss: 0.0024 - accuracy: 0.9993 - 22s/epoch - 2ms/step
Epoch 3/10
9771/9771 - 22s - loss: 0.0025 - accuracy: 0.9992 - 22s/epoch - 2ms/step
Epoch 4/10
9771/9771 - 22s - loss: 0.0019 - accuracy: 0.9995 - 22s/epoch - 2ms/step
Epoch 5/10
9771/9771 - 22s - loss: 0.0020 - accuracy: 0.9994 - 22s/epoch - 2ms/step
Epoch 6/10
9771/9771 - 21s - loss: 0.0015 - accuracy: 0.9995 - 21s/epoch - 2ms/step
Epoch 7/10
9771/9771 - 22s - loss: 0.0017 - accuracy: 0.9996 - 22s/epoch - 2ms/step
Epoch 8/10
9771/9771 - 22s - loss: 0.0015 - accuracy: 0.9996 - 22s/epoch - 2ms/step
Epoch 9/10
9771/9771 - 23s - loss: 0.0023 - accuracy: 0.9993 - 23s/epoch - 2ms/step
Epoch 10/10
9771/9771 - 22s - loss: 0.0015 - accuracy: 0.9996 - 22s/epoch - 2ms/step

Confusion Matrix:
[[10083     1]
 [   10  2120]] 

Accuracy: 
0.9990993941378745 

========================================================================

In [27]:
# Creates a dictionary of all of the epoch counts that were tested and their accuracies
accuracies_dict = {}
for i in range(len(accuracies)):
    accuracies_dict[i + 1] = accuracies[i]

# Creates a bar plot from the dictionary
plt.figure(figsize = (12, 10))
sns.barplot(y = list(accuracies_dict.values()), x = list(accuracies_dict.keys()), palette = 'PiYG')
plt.ylim(0.98, 1)
plt.title('Comparison of Artificial Neural Network Epoch Count', fontsize = 16)
plt.xlabel('Epoch Count', fontsize = 14)
plt.ylabel('Accuracy', fontsize = 14)
Out[27]:
Text(0, 0.5, 'Accuracy')
In [28]:
# Fits the artificial neural network with nine epochs
print('Number of Epochs: 9\n\nFitting:')
fits.append(classifier.fit(X_train, y_train, batch_size = 5, epochs = 9, verbose = 2))
print('\nConfusion Matrix:')
y_pred = (classifier.predict(X_test, verbose = 0) > 0.5)
cm = confusion_matrix(y_test, y_pred)
print(cm, '\n\nAccuracy: ')
tn, fp, fn, tp = cm.ravel()
accuracy = (tn + tp) / (tn + tp + fn + fp)

# Tests the fit on three example mushrooms
print(accuracy, '\n\nTest Mushroom 1:')
new_customer = [[1, 0.73, 6, 2, 5, 0, -1, 5, 3.72, 0.97, -1, 2, 4, -1, -1, 0, 1, -1, 0, 1]]
new_customer = sc.transform(new_customer)
new_prediction = classifier.predict(new_customer, verbose = 0)
print(new_prediction, '\n\nTest Mushroom 2:')
new_customer = [[0, 13.43, 6, -1, 5, -1, -1, 10, 12.47, 20.63, 0, -1, 11, 0, 4, 1, 2, -1, 0, 2]]
new_customer = sc.transform(new_customer)
new_prediction = classifier.predict(new_customer, verbose = 0)
print(new_prediction, '\n\nTest Mushroom 3:')
new_customer = [[0, 9.48, 5, -1, 5, 5, 0, 3, 8.62, 22.77, 0, -1, 10, -1, -1, 0, 1, -1, 1, 3]]
new_customer = sc.transform(new_customer)
new_prediction = classifier.predict(new_customer, verbose = 0)
print(new_prediction)
Number of Epochs: 9

Fitting:
Epoch 1/9
9771/9771 - 12s - loss: 0.0021 - accuracy: 0.9994 - 12s/epoch - 1ms/step
Epoch 2/9
9771/9771 - 15s - loss: 0.0020 - accuracy: 0.9994 - 15s/epoch - 2ms/step
Epoch 3/9
9771/9771 - 11s - loss: 0.0021 - accuracy: 0.9995 - 11s/epoch - 1ms/step
Epoch 4/9
9771/9771 - 11s - loss: 0.0019 - accuracy: 0.9995 - 11s/epoch - 1ms/step
Epoch 5/9
9771/9771 - 11s - loss: 0.0021 - accuracy: 0.9995 - 11s/epoch - 1ms/step
Epoch 6/9
9771/9771 - 11s - loss: 0.0018 - accuracy: 0.9994 - 11s/epoch - 1ms/step
Epoch 7/9
9771/9771 - 11s - loss: 0.0019 - accuracy: 0.9996 - 11s/epoch - 1ms/step
Epoch 8/9
9771/9771 - 11s - loss: 0.0012 - accuracy: 0.9994 - 11s/epoch - 1ms/step
Epoch 9/9
9771/9771 - 11s - loss: 0.0012 - accuracy: 0.9996 - 11s/epoch - 1ms/step

Confusion Matrix:
[[10083     1]
 [    0  2130]] 

Accuracy: 
0.9999181267398067 

Test Mushroom 1:
[[7.344817e-09]] 

Test Mushroom 2:
[[0.99999857]] 

Test Mushroom 3:
[[8.545989e-15]]

    In order to analyze the quality of the artificial neural networks, the algorithms first had to be created. Ten neural networks were created, each having a given number of epochs (1-10) and a batch size of five. Each of these algorithms attempted to predict whether a given mushroom would bruise/bleed or not. The most accurate algorithm had nine epochs, it had a accuracy of 99.98%. This was nearly the most accurate algorithm created, but the k-nearest neighbors algorithm slightly beats it. All of the algorithms were incredibly accurate, all of them having an over 99% accuracy, besides the algorithm with one epoch which was in the 98% accuracy range. Interestingly, the algorithm with eight epochs has zero false positives. All of the algorithms were visualized in a bar chart and the most accurate algorithm, with nine epochs, was tested on three random mushrooms. It predicted whether each of these mushrooms bruise or bleed correctly.

How will k-means clustering cluster the mushrooms?¶

In [29]:
# Scales the data frame
X_scaled = sc.fit_transform(X)

# Creates a list of all of the WCSS scores on the number of clusters
score_1 = []
for i in range(1, 20):
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(X_scaled)
    score_1.append(kmeans.inertia_)

# Creates a line plot from the list
plt.figure(figsize = (12, 10))
plt.plot(score_1, 'bx-', color = '#4dac26')
plt.title('WCSS vs. Number of Clusters', fontsize = 16)
plt.xlabel('Number of Clusters', fontsize = 14)
plt.ylabel('WCSS Score', fontsize = 14)
Out[29]:
Text(0, 0.5, 'WCSS Score')
In [30]:
# Creates bar plots for each of the variables in each of the clusters
kmeans = KMeans(n_clusters = 7, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
labels = kmeans.fit_predict(X_scaled)
X_cluster = pd.concat([X, pd.DataFrame({'cluster': labels})], axis = 1)
for i in X_cluster.columns:
    plt.figure(figsize = (35, 5))
    for j in range(7):
        plt.subplot(1, 8, j + 1)
        cluster = X_cluster[X_cluster['cluster'] == j]
        cluster[i].hist(bins = 20, color = '#4dac26')
        plt.title( '{}\nCluster {}'.format(i, j + 1))
In [31]:
# Creates a principal component analysis of the clusters
pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(X_scaled)
pca_X = pd.DataFrame(data = principal_comp, columns = ['pca1', 'pca2'])
pca_X = pd.concat([pca_X, pd.DataFrame({'cluster': labels})], axis = 1)

# Creates a dot plot of the data frame
plt.figure(figsize = (12, 10))
sns.scatterplot(x = 'pca1', y = 'pca2', hue = 'cluster', data = pca_X, palette = 'PiYG', s = 1, edgecolor = 'k', linewidth = 0.05)
plt.title('Clusters', fontsize = 16)
plt.xlabel('PCA1', fontsize = 14)
plt.ylabel('PCA2', fontsize = 14)
Out[31]:
Text(0, 0.5, 'PCA2')

    In order to analyze the clusters made by the k-means clustering algorithm, the algorithm first had to be created. In order to create the algorithm, an elbow plot was made using a line plot. It was determined the elbow was at seven, and therefore, there should be seven clusters. When analyzing the bar plots created for each of the variables in each of the columns, it is not entirely clear how the algorithm grouped them. However, there are some standouts. In particular, every value in cluster six has a universal veil, while every value in the other columns had no veils at all. Furthermore, cluster two had the widest range of all of the numeric data. The tallest stems, girthiest stems, and widest caps were all in cluster two. Lastly, a scatter plot was created based on a principal component analysis of the clusters to visualize their groupings.

Conclusion¶

    In conclusion, not many correlation were found between the few numeric values in this dataset. The primary correlation found was between cap diameter and stem width. Cap diameter was not very correlated with stem height and stem height was not very correlated with stem width. This makes sense, a wider stem is needed to hold up a wider cap, but a longer stem is not needed to hold up a wider cap and stem width and height do not affect each other at all. It was also determined that the others category of cap shapes have the average widest caps of all cap shapes, while the bell cap shape have the skinniest. Furthermore, the fleshy cap surface was shown to have the highest average cap diameter between the cap surface types, while grooves has the lowest. Swollen roots clearly have the highest stems of all the roots, but club roots have the widest stems of all the roots.

    In terms of machine learning, it has become clear that supervised learning algorithms are optimal for this dataset. The unsupervised k-means clustering algorithm clustered the mushrooms pretty nonsensically, but most of the supervised learning algorithms were successful in predicting whether the mushroom would bruise/bleed or not. In particular, the k-nearest neighbors algorithm was the most optimal algorithm, with the artificial neural network with nine epochs not far behind. However, none of the supervised learning algorithms were particularly bad besides the naïve Bayes algorithm with a 32.83% accuracy. The logistic regression and linear discriminant analysis algorithms were also not quite as good as the other algorithms, but they were still both about 85% accurate. The ten artificial neural networks, support vector machine, classification and regression tree, and k-nearest neighbors algorithms were all incredibly successful.